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1  INTRODUCTION 


1  Introduction 

This  paper  presents  the  design  of  Proteus,  a  simulator  for  MIMD  multiprocessors.  Proteus  is  -n 
execution-driven  simulator  [CMM+88];  it  multiplexes  a  single  processor  among  the  various  activities 
in  a  simulated  parallel  machine  to  provide  accurate  information  about  the  timing  and  behavior  of  an 
application  and  the  underlying  simulated  architecture.  Proteus  is  fast,  accurate,  and  flexible:  it  is 
one  to  two  orders  of  magnitude  faster  than  comparable  simulators,  it  can  reproduce  results  from  real 

multiprocessors,  and  it  is  easily  configured  to  simulate  a  wide  range  of  MIMD  architectures.  Proteus’ 

i 

modular  structure  allows  users  to  tradeoff  accuracy  and  performance:  users  pay  for  accuracy  only  when 
and  where  they  need  it.  The  structure  also  allows  easy  customization  of  the  architecture.  Finally,  Pro¬ 
teus  provides  repeatability  and  nonintrusive  monitoring  and  debugging,  which  result  in  a  development 
environment  superior  to  those  available  on  real  multiprocessors.. 

We  believe  that  simulation  has  a  valuable  role  to  play  at  all  levels  of  the  design  and  analysis  of 
multiprocessor  systems,  from  architectures  to  runtime  systems  to  algorithms  and  applications.  Many 
projects  have  used  simulation  during  the  development  of  new  architectures  to  guide  the  design.  We 
believe  that  simulation  has  an  equally  vital  role  to  play  in  the  development  of  software  systems  for 
multiprocessors. 

There  are  two  alternatives  to  simulation:  analytical  modeling  and  using  real  machines.  Multiproces¬ 
sor  systems  are  sufficiently  complex  that  analytical  modeling  is  difficult.  On  the  other  hand,  using  a  real 
machine  to  test,  debug,  and  tune  a  program  is  problematic.  In  contrast,  simulation  allows  nonintrusive 
monitoring  and  debugging,  and  also  makes  it  easy  to  repeat  executions  so  that  different  phenomena  in 
an  execution  can  be  studied  at  a  variety  of  levels  of  detail.1  Another  important  advantage  of  simulation 
is  flexibility.  Using  a  simulator  such  as  Proteus,  we  can  study  the  behavior  of  a  program  on  many  dif¬ 
ferent  architectures.  For  example,  alternative  memory  systems  can  be  simulated,  giving  insight  into  the 
interactions  among  applications,  compilers,  and  cache-management  techniques.  Similarly,  the  number 
of  processors  can  be  varied,  giving  insight  into  the  scalability  of  a  program  or  algorithm  (perhaps  well 
beyond  the  limits  imposed  by  real  machines). 

For  all  its  advantages,  simulation  has  potential  problems  in  two  areas — speed  and  accuracy  that  can 
make  it  less  useful.  First,  simulators  are  often  slow,  making  it  impossible  to  run  large  experiments  or  sets 
of  experiments.  Second,  simulators  are  often  inaccurate,  making  it  difficult  to  draw  useful  conclusions 
from  the  results  of  a  simulation.  Proteus  is  an  execution-driven  simulator  that  interleaves  the  execution 
of  an  application  program  with  the  simulation  of  the  underlying  architecture;  this  makes  it  possible  to 
achieve  very  high  accuracy.  In  addition,  Proteus  avoids  interpreting  user  application  code  whenever 

1  Some  parallel  debuggers  support  repeatability— e.g.,  Instant  Replay  [LM86] — but  at  the  cost  of  maintaining  huge  trace 
files  and  of  introducing  a  significant  probe  effect  [Gai86]. 
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possible,  thus  removing  the  overhead  of  interpretation  for  most  instructions.  Proteus  is  also  designed 
so  that  the  entire  simulation  system,  including  the  application  program  and  the  network  and  memory 
simulators,  runs  in  a  single  address  space.  These  and  other  factors  discussed  in  Section  6  result  in  a 
performance  improvement  of  one  to  two  orders  of  magnitude  when  compared  to  other  simulators  with 
comparable  flexibility  such  as  Tango  [DGH91]. 

Another  important  feature  of  Proteus  is  the  ability  it  provides  the  user  to  control  the  level  of 
accuracy  of  the  simulation.  In  general,  there  is  a  tradeoff  between  accuracy  and  performance:  a  more 
accurate  simulation  requires  more  time.  Since  the  level  of  accuracy  desired  and  the  amount  of  informa¬ 
tion  needed  from  a  simulation  depend  on  the  application,  PROTEUS  provides  users  with  unprecedented 
flexibility  in  choosing  or  customizing  the  level  of  accuracy  in  the  network  and  memory  simulations.  The 
user  can  also  control  what  monitoring  data  is  produced,  both  for  system-level  data  (e.g.,  shared-memory 
traces)  and  user-level  data  (e.g.,  the  time  spent  in  a  code  section,  or  the  size  of  a  data  structure).  As 
discussed  in  more  detail  below,  changing  the  level  of  accuracy  of  the  simulation  makes  a  large  difference 
in  the  running  time.  For  users  who  need  large  simulations  or  sets  c?  simulations,  it  is  important  that 
they  be  able  to  pay  only  for  the  accuracy  they  need. 

Proteus  was  originally  designed  for  evaluating  language,  compiler,  and  runtime  syrtem  mechanisms 
to  support  portability;  thus,  flexibility,  accuracy,  and  performance  are  all  important.  We  have  also  used 
it  for  algorithmic  and  architectural  studies,  including  concurrent  search  trees  and  network  and  cache 
research  [CBDW91].  In  general,  PROTEUS  is  an  excellent  development  platform  for  parallel  software:  it 
supports  testing  and  debugging,  performance  evaluation  and  tuning,  and  graphical  output. 

Section  2  provides  an  overview  of  the  simulator,  Section  3  discusses  Proteus’  modular  structure,  and 
Section  4  describes  the  use  of  direct  execution  and  augmentation.  Support  for  debugging,  monitoring 
and  graphics  is  discussed  in  Section  5,  while  Section  6  evaluates  overall  system  performance.  Section  7 
presents  evidence  on  the  accuracy  of  Proteus:  it  compares  simulation  results  to  published  empirical 
data  from  an  nCUBE  multiprocessor.  Finally,  Section  8  describes  related  work  and  Section  9  presents 
our  conclusions. 


2  Overview 

Proteus  is  not  actually  a  simulator,  rather,  it  is  an  simulation  engine  that  combines  with  architecture- 
specific  modules  and  user  applications  to  create  a  simulator.  The  resulting  executable  provides  high- 
performance  simulation  of  the  user’s  application  on  the  target  architecture.  This  section  presents  a  brief 
overview  of  Proteus,  including  the  basic  multiprocessor  model,  the  programming  language,  and  the 
step3  involved  in  building  and  using  Proteus  simulators. 
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3  MODULES 


PROTEUS  simulates  MIMD  multiprocessors  in  which  independent  processor  nodes  are  connected  via 
an  interconnection  medium.  The  interconnection? medium  can  be  either  a  bus,  a  direct  network  such  as 
a  ifc-ary  n-cube,  or  an  indirect  network  such  as  a  butterfly.  Eachprocessor  node  consists  of  a  processor, 
a  network  chip,  a  cache  chip,  and  memory.  Conceptually,  the  processor  is  a  generic  sequential  processor 
extended  with  instructions  for  network  access  and  cache  coherence.  The  network  chip  interfaces  the 
processor  with  the  interconnection  medium.  The  cache  chip,  which  is  optional,  handles  cache  coherence 
and  works  with  the  network  chip  for  remote  memory  accesses. 

The  memory  at  each  node  is  divided  into  two  sections,  a  shared  section  that  maps  to  part  of  a 
global  address  space,  and  a  private  section  that  is  not  accessible  from  the  interconnection  medium.  For 
distributed-memory  machines,  the  size  of  the  shared  section  is  zero.  Proteus  can  simulate  hardware 
cache  coherence  for  global  memory  and  provides  primitives  for  software  coherence. 

Users  write  applications  in  a  superset  of  G.  The  extensions  include  keywords  for  declaring  that  data 
lesicle  in  shared  memory  and  for  controlling  the  placement  of  data  structures.  Proteus  provides  library 
routines  for  message  passing,  thread  management,  memory  management,  and  data  collection. 

There  are  four  steps  in  the  creation  and  use  of  a  Proteus  simulator.  First,  the  user  specifies  the 
architecture  using  an  X-based  configuration  tool.  Second,  the  application-  and  architecture-specific 
simulator  is  compiled  and  linked  into  an  executable.  Next,  the  user  runs  the  executable  to  produce 
screen  output  and  a  trace  file.  Finally,  PROTEUS  includes  a  sophisticated  X-based  graph  generator, 
discussed  in  Section  5.4,  that  interprets  the  trace  file  and  presents  the  results  of  the  simulation.  2 

3  Modules 

Proteus  was  designed  with  a  modular  structure  to  simplify  replacement  and  customization  of  specific 
parts  of  the  simulator.  The  modular  structure  provides  two  very  important  abilities.  First,  the  structure 
simplifies  customizing  the  target  architecture:  it  is  very  easy  to  experiment  with  part  of  the  architecture 
while  keeping  the  rest  unchanged.  This  makes  Proteus  useful  for  evaluating  architectural  design 
decisions,  and  for  simulating  specific  multiprocessors.  Second,  the  modular  structure  promotes  multiple 
implementations  of  a  given  module,  which  allows  users  to  switch  between  very  accurate  versions  and  very 
fast  versions.  Users  pay  only  for  what  they  need;  in  particular,  the  high-performance  versions  greatly 
reduce  development  time.  This  section  describes  the  four  most  important  modules,  uses  the  network 
module  to  demonstrate  the  effectiveness  of  the  structure,  and  discusses  the  use  of  modules  to  tradeoff 
accuracy  and  performance. 

The  operating  system  module  provides  a  kernel  operating  system  for  the  simulated  multiprocessor. 


2  All  of  tlic  graphs  in  tliis  paper  were  produced  by  Proteus’  graph  generator. 
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3.1  The  Network  Module  Interface 

The  kernel  interface  specifies  procedures  for  thread  scheduling  and  management,  memory  management, 
and  interrupt  and  trap  handling.  In  addition  to  the  kernel  interrupt  handlers,  users  may  define  their  own 
interprocessor  interrupts  (IPIs)  and  handlers,  for  example,  user-defined  IPIs  are  used  to  build  dispatch 
routines  for  message-passing  architectures. 

The  shared-memory  module  provides  access  to  local  shared  memory,  handles  full-empty  bits  [Smi81], 
and  provides  atomic  operations  such  as  test-and-set  and  compare-and-swap.  The  shared  memory  of  a 
remote  processor  is  not  accessed  directly  via  the  shared-memory  module,  instead,  a  network  request  is 
generated  (usually  by  the  cache  module)  that  invokes  the  shared-memory  module  when  it  arrives  at  the 
remote  node.  Separating  the  remote  access  into  a  network  portion  and  a  local-memory  portion  allows 
the  network  and  shared-memory  modules  to  be  replaced  independently. 

The  cache  module  handles  memory  requests  from  the  local  processor  and  from  the  local  network 
chip.  It  generates  calls  to  both  the  shared-memory  module  (for  local  accesses)  and  the  network  module 
(for  remote  accesses).  The  primary  operations  provided  by  the  cache  module  are  read,  write,  and  flush, 
l'a  addition,  the  module  defines  operations  for  software  coherence:  soft  read  and  write,  and  fence  [SS87]. 
The  intent  of  the  soft  operations  is  to  access  the  currently  cached,  possibly  stale,  data.  The  fence 
operation  blocks  until  all  pending  protocol  transactions  for  the  given  cache  line  have  completed  and  is 
used  to  ensure  coherence  for  that  cache  line. 

The  network  module,  described  in  detail  in  the  next  subsection,  simulates  the  movement  of  data 
within  the  interconnection  medium. 

3.1  The  Network  Module  Interface 

The  network  module  is  a  good  example  of  the  modular  structure  of  Proteus.  It  demonstrates  the  two 
key  advantages  of  PROTEUS’  modular  structure,  the  simplicity  of  customization  and  the  use  of  multiple 
versions  to  provide  a  range  of  a:curacy  and  performance.  The  user  must  modify  only  three  procedures  to 
replace  the  network  module.  The  multiple  versions,  which  are  discussed  in  Section  3.2,  provide  orders  of 
magnitude  performance  differences  depending  on  the  required  accuracy.  Before  discussing  the  network 
module,  a  brief  discussion  of  the  simulator  engine  is  in  order. 

Instructions  that  affect  remote  nodes  are  implemented  using  simulator  requests,  which  are  times- 
tamped  structures  stored  in  a  central  priority  queue.  Such  a  non-local  instruction  generates  a  simulator 
request  and  inserts  it  into  the  priority  queue,  which  is  sorted  by  timestamp.  The  engine  repeatedly  exe¬ 
cutes  the  request  with  the  lowest  timestamp  until  there  are  no  requests  left,  at  which  time  the  simulation 
is  complete.  Each  request  type  has  a  associated  procedure,  the  engine  executes  a  request  by  calling  the 
associated  procedure. 

The  network  module  uses  three  types  of  requests.  The  first  is  a  send  request,  which  signifies  that  the 
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3  MODULES 


Request  Generation 

void  SendClrom,  to,  time,  packet,  mode) 
void  Route (next,  time,  packet) 
void  Receive (from,  time,  packet) 


Request  Execution 

void  s  end-request  .handler  (SimRequest) 
void  routerrequest .handler (SimRequest) 
void  receive-request -handler(SimRequest) 


Table  1:  Interface  for  the  network  module. 


processor  is  ready  to  send  a  packet  to  the  network  chip.  The  second  type  of  request  is  the  route  request, 
which  computes  the  next  node  for  a  packet  and  computes  the  arrival  time  of  the  packet  at  that  node. 
Some  versions,  such  as  a  bus,  do  not  use  this  request  at  all.  The  third  type  is  the  receive  request,  which 
occurs  when  the  packet  reaches  the  target  node.  The  receive  request  either  interrupts  the  processor  or 
notifies  the  cache  chip  depending  on  the  packet.  Only  the  network  module  generates  route  and  receive 
requests;  all  other  modules  generate  only  send  requests.  Table  1  lists  the  procedures  for  generating  and 
executing  network  requests.  New  versions  of  the  network  module  only  need  to  replace  the  procedures 
for  executing  network  requests. 

Typically,  the  send  request  generates  two  requests:  one  to  resume  the  processor  at  the  appropriate 
time  and  a  route  request  to  move  the  packet  to  the  next  node  or  switch.  If  the  network  chip  uses 
DMA  to  get  the  packet,  then  the  processor  is  resumed  fairly  quickly.  Other  architectures,  such  as  the 
J-machine  [D+89],  require  that  the  processor  feed  the  packet  to  the  network  chip  word  by  word.  In  this 
case,  the  delay  depends  on  the  length  of  the  packet.  The  mode  argument  is  used  to  pass  flags  to  the 
module.  At  the  moment,  the  only  flag  determines  whether  or  not  to  interrupt  the  processor  when  the 
DMA  completes  (assuming  the  network  chip  uses  DMA). 

The  route  request  computes  the  node  to  which  the  packet  should  be  forwarded.  For  example,  in 
a  k- ary  n-cube  the  route  request  determines  which  output  channel  to  use,  based  on  the  target  node, 
the  incoming  channel,  and  possibly  the  contention  on  the  output  channels  of  the  current  node.  It  then 
computes  the  arrival  time  of  the  packet  at  the  next  node,  using  the  current  time  and  information  about 
when  the  channel  will  be  available.  Only  the  route  handler  needs  to  know  anything  about  channels  and 
contention.  It  the  next  node  is  the  target,  the  route  handler  generates  a  receive  request. 

The  receive  request  looks  at  the  type  of  the  packet,  which  is  either  a  memory  packet  or  an  IPI  packet. 
Memory  packets  are  handed  to  the  cache  module,  which  defines  a  procedure  specifically  for  handling 
network  packets.  An  IPI  packet  causes  an  interrupt  of  the  local  processor. 

For  -  specific  architecture,  it  is  common  to  provide  additional  procedures  in  the  network  module 
that  improve  the  accuracy  of  the  module.  For  example,  the  network  chip  for  the  Alewife  multiprocessor 
provides  a  way  to  check  if  the  chip  is  busy.  We  added  a  procedure  that  returns  true  if  the  channel  is 
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busy;  we  set  its  cost  to  four  cycles,  which  is  the  time  it  would  take  to  load  and  check  the  busy  flag. 

Using  this  structure,  most  network  changes,  including  routing  algorithm  and  topology  changes,  re¬ 
quire  modifications  to  only  the  route  request  handler.  Most  detailed  network  modules  are  only  a  few 
hundred  lines  total,  and  often  much  of  the  code  can  be  inherited  from  existing  network  modules.  The 
nCUBE  network  module  used  for  the  experiments  described  in  Section  7  took  less  than  a  day  to  imple¬ 
ment. 


3.2  Trading  Accuracy  for  Performance 

Depending  on  the  end  goals  of  a  simulation,  some  modules  may  have  to  be  very  accurate  while  others 
can  be  less  accurate.  For  example,  users  studying  scheduling  require  very  accurate  costs  in  the  operating 
systems  module  but  may  not  need  detailed  network  simulation.  Furthermore,  during  development, 
users  generally  prefer  to  avoid  the  lower  performance  of  the  most  accurate  modules.  The  ability  to 
replace  modules  provides  a  simple  way  to  trade  accuracy  for  performance:  Proteus  provides  both  a 
very  accurate  version  of  a  module  and  a  high-performance  version  with  the  same  semantics  but  lower 
accuracy.3  Currently,  the  network  module  and  the  cache  module  exploit  this  tradeoff. 

The  accurate  version  of  the  fc-ary  n-cube  network  module  simulates  the  progress  of  each  packet  hop  by 
hop.  This  allows  complete  simulation  of  network  contention,  including  hot  spots.  It  correctly  simulates 
uni-  and  bidirectional  edges,  end-around  connections,  internal  switch  delays,  and  virtual  channels  [DS87]. 

The  high-performance  version  uses  an  analytical  model  developed  by  Agarwal  [AgaSl].  Instead’ of 
simulating  each  hop,  it  computes  the  arrival  time  at  the  target  using  a  formula  presented  in  the  paper  and 
a  contention  factor  based  on  a  sliding-'window  view  of  recent  network  traffic.  This  version  is  acceptable 
when  the  traffic  is  mild.  Although  the  high-performance  version  has  limited  accuracy,  it  is  more  than 
an  order  of  magnitude  faster  than  the  exact  version. 

The  analytical  model  used  in  the  high-performance  module  produces  incorrect  arrival  times  both 
..hen  there  are  hot  spots  and  when  there  is  no  contention  at  all.  As  an  example  of  the  latter,  consider 
a  pipeline  application  that  has  high  network  traffic  but  no  contention.  The  high  traffic  leads  to  a  high 
contention  factor,  even  though  none  of  the  packets  contend  for  an  edge.  Thus  the  model-based  version 
artificially  inflates  network  delays  when  there  is  no  contention.4 

The  accurate  cache  module  simulates  Chaiken’s  cache-coherence  protocol  for  direct  networks  [Cha90]. 
It  simulates  all  of  the  cache  states  and  protocol  packets.  The  less  accurate  module  simply  provides 
coherent  shared  memory  by  not  caching  at  all;  it  always  goes  over  the  network  for  remote  memory 


1  Versions  with  intermediate  performance  and  accuracy  arc  possible,  the  cache  module  current  ly  provides  three  versions. 
4  Although  easy  to  see  in  hindsight,  the  inaccuracy  at  zero  contention  was  first  noticed  in  PnoTEUS  simulations,  it  was 
a  surprise  even  to  the  author  of  the  model. 
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3  MODULES 


Analytical  Network  Model 

Hop-by-hop  Network 

Uniform  Cost 

1,500,000 

700,000 

No  Caching 

1,000,000 

400,000 

Coherent  Cache 

500,000 

120,000 

Numbers  are  in  simulated  cycles  per  second. 


Table  2:  This  table  shows  the  relative  system  performance  of  the  six  combinations  of  network  and  cache 
modules.  The  numbers  are  for  the  8-qucens  application  running  on  an  8x8  mesh.  The  simulations 
were  run  on  a  DECstation  5000.  These  numbers  vary  quite  a  bit  depending  on  the  application  and  the 
architecture,  but  the  relative  magnitudes  are  typical. 


accesses.  Although  this  increases  network  traffic,  the  overall  system  performance  improves  substantially. 
A  third  version  runs  even  faster:  it  accesses  global  memory  directly,  that  is,  without  using  the  network.5 
It  assigns  all  global  memory  accesses  a  single  fixed  cost.  Note  that  all  three  versions  have  the  same 
semantics,  the  only  difference  is  the  cost  of  accesses. 

Table  2  shows  the  relative  system  performance  of  the  six  combinations  of  cache  and  network  modules 
for  an  8-queens  application  running  on  an  8x8  mesh.  There  is  more  than  a  ten-fold  difference  in  perfor¬ 
mance  between  the  least  and  most  accurate  combinations.  Most  simulations  achieve  well  over  one  million 
simulated  cycles  per  second,  since  the  accuracy  is  usually  not  needed  during  application  development. 

In  summary,  the  modular  structure  of  PROTEUS  allows  easy  replacement  and  customization  of  indi¬ 
vidual  parts  of  the  simulator.  This  allows  users  to  tailor  Proteus  to  a  particular  architecture.  We  have 
exploited  this  ability  to  reproduce  both  the  nCUBE  [FJL+88],  a  message-passing  multiprocessor,  and 
Alewife  [A+91],  a  shared-memory  multiprocessor.  (Section  7  describes  the  correspondence  between  the 
nCUBE  version  of  Proteus  and  the  real  nCUBE.)  The  modular  structure  also  allows  selection  of  mod¬ 
ules  based  on  required  accuracy,  which  allows  users  to  maximize  performance  for  a  particular  simulation 
by  trading  unneeded  accuracy  for  increased  performance.  In  particular,  users  can  exploit  more  than  a 
ten-fold  gain  in  performance  during  development  by  forfeiting  detailed  simulation  of  the  network  and 
cache.  Eater,  when  their  code  is  debugged,  they  can  switch  to  more  accurate  modules  without  modifying 
their  code. 


5This  is  possible  because  PROTEUS  runs  in  a  single  address  space. 
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4  Direct  Execution 

A  primary  factor  in  the  performance  of  Proteus  is  the  use  of  direct  execution  to  provide  very  low- 
overhead  simulation  of  most  instructions.  The  key  idea  is  to  execute  local  instructions  directly  and 
augment  the  code  with  cycle-counting  instructions  to  time  .he  code.  This  section  presents  an  overview 
of  direct  execution  with  augmentation  and  discusses  the  flexibility  it  provides  and  the  assumptions  it 
requires. 

PROTEUS  directly  executes  local  instructions.  An  instruction  is  'ocal  if  it  only  affects  the  local 
processor.  For  example,  all  register-to-register  instructions  are  local  instruc*»ons.  An  instruction  that 
might  affect  another  part  of  the  system  is  a  non-local  instruction.  All  shared-memory  accesses  and 
network  instructions  are  non-local.  Proteus  simulates  local  instructions  by  directly  executing  the 
instruction  on  the  host  workstation;  non-local  instructions  are  simulated  via  a  procedure  call. 

Although  direct  execution  provides  the  correct  functionality  of  local  instructions,  it  ignores  the 
simulated  time  required  to  execute  them.  PROTEUS  uses  code  augmentation  to  count  the  cycles  required 
by  local  instructions.  For  each  basic  block  of  local  instructions,  code  is  added  to  increment  a  global  cycle 
counter  by  the  number  of  cycles  required  to  execute  that  block.  Because  the  counter  is  incremented 
every  time  a  block  executes,  the  counter  correctly  tracks  the  required  cycles  for  any  path  through  the 
control-flow  graph. 

The  use  of  direct  execution  with  augmentation  was  used  first  by  Mathieson  and  Francis  [MF88]  and 
by  Covington  et  al.  [CMM+88].  The  technique  has  been  used  in  several  other  simulators  [DGH91,  Che89, 
SF89]-  We  extend  the  work  in  this  area  in  three  ways.  First,  Proteus  provides  support  for  nonintrusive 
monitoring,  which  is  discussed  in  Section  5.1. 

Second,  profiling  information,  similar  to  the  Unix  tool  prof  [DECb],  can  be  generated  by  using  a 
procedure-specific  cycle  counter  in  addition  to  the  global  cycle  counter.  This  produces  very  accurate 
counts  of  the  simulated  cycles  spent  in  each  procedure.  As  with  prof,  the  profiling  information  guides 
tuning  and  aids  debugging.  Unlike  prof ,  which  uses  periodic  sampling  to  collect  profiling  data,  Proteus 
profiling  data  is  exact. 

Third,  we  use  augmentation  to  limit  the  number  of  cycles  a  single  thread  can  execute  without 
returning  control  to  the  simulator  engine.  This  limit,  called  the  quantum,  keeps  processors  close  together 
in  simulated  time.  Normally,  processors  are  kept  close  together  simply  because  they  perform  non¬ 
local  instructions,  which  always  return  control  to  the  engine.  However,  without  the  quantum,  loops 
containing  only  local  instructions  can  cause  a  thread  to  get  thousands  of  cycles  ahead.  This  affects 
arriving  interrupts,  which  may  get  artificially  delayed  thousands  of  cycles.  The  quantum  also  prevents 
infinite  loops  in  user  code  from  hindering  debugging,  since  the  simulator  regularly  regains  control,  the 
user  can  enter  debugging  mode  and  easily  determine  which  processors  and  which  procedures  are  in 
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4  DIRECT  EXECUTION 


Program 

Normal  Cycles 

Augmented  Cycles 

Overhead  Factor 

queue 

65,876,631 

144,581,647 

2.2 

Sieve 

52,483,384 

130,868,590 

2.5 

augment 

11,670,316 

24,578,648 

2.1 

Minimum  ASIM  overhead 

200 

Table  3:  Measuring  the  overhead  of  augmentation.  This  table  compares  several  sequential  programs 
with  and  without  augmentation.  The  cycles  were  determined  by  pixie  [DECa],  a  profiling  tool  available 
on  MIPS-based  workstations.  The  overhead  factor  is  the  ratio  of  the  pixie  cycle  count  for  the  aug¬ 
mented  version  over  that  of  the  normal  version.  The  overhead  is  consistently  a  small  factor.  ASIM  is  a 
multiprocessor  simulator  developed  for  the  Alewife  project  at  MIT  [A+91,  CLN90];  it  is  representative 
of  instruction-interpreting  simulators. 


infinite  loops. 

The  simulation  overhead  incurred  by  code  augmentation  is  much  lower  than  that  incurred  by  in¬ 
struction  interpretation,  which  is  used  in  most  processor  simulators.  Table  3  shows  the  overhead  due 
to  augmentation  for  three  sequential  programs.  As  discussed  by  Davis  et  al.  [DGII91],  the  overhead 
for  augmentation  is  about  a  factor  of  two,  which  is  about  one  hundred  times  lower  than  the  overhead 
for  instruction  interpretation.  Unfortunately,  these  numbers  only  apply  for  local  instructions;  non-local 
instructions  must  still  be  interpreted.  Thus  the  overall  performance  of  Proteus,  which  is  discussed  in 
Section  6,  is  rarely  one  hundred  times  faster  than  instruction-interpreting  simulators. 

The  hundred-fold  performance  gain  for  local  instructions  does  not  come  for  free.  Using  direct  execu¬ 
tion  with  augmentation  requires  several  assumptions  that  are  not  required  by  simulators  that  interpret 
every  instruction.  First,  because  Proteus  determines  the  cost  of  each  basic  block  at  compile  time,  the 
cost  of  a  block  is  a  fixed  number  of  cycles.  In  reality,  the  cost  of  an  instruction  depends  on  cache  hits 
and  sometimes  on  the  operands.  Thus,  we  use  the  expected  cost  of  the  instruction,  taking  into  account 
both  the  expected  number  of  cycles  for  the  instruction  and  the  expected  delay  due  to  cache  misses.  In 
essence,  we  assume  uniform  cache  hit  rates  for  instructions  and  data  in  private  memory.  (Shared- memory 
accesses  are  simulated  in  detail  and  thus  avoid  this  assumption.)  This  assumption  is  reasonable  because 
uniprocessor  cache  hit  rates  are  very  high,  and  because  small  periodic  errors  in  instruction  costs  rarely 
affect  overall  simulation  results. 

A  second  and  related  assumption  is  that  code  and  stacks  reside  in  private  memory.  If  code  resides  in 
shared  memory,  Proteus  must  simulate  the  cache-coherence  protocol  for  every  instruction  fetch,  which 
removes  most  of  the  performance  benefit  of  direct  execution.  Likewise,  if  stacks  reside  in  shared  memory, 
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every  stack  access  must  be  simulated  in  detail,  which  again  results  in  a  severe  loss  of  performance. 
Section  8  discusses  future  plans  regarding  this  assumption. 

The  errors  due  to  these  assumptions  are  small  and  localized;  in  practice,  they  have  had  negligible 
efTect.  Section  7  compares  Proteus  results  with  those  of-real  multiprocessors;  for  these  applications, 
our  assumptions  are  validated. 

5  Monitoring  and  Debugging 

In  addition  to  performance,  a  primary  asset  of  Proteus  is  its  support  for  monitoring  and  debugging. 
Proteus  provides  nonintrusive  monitoring  and  debugging:  users  can  add  monitoring  code  that  does 
not  affect  the  behavior  or  timing  of  the  simulation.  Proteus  also  provides  repeatability:  users  can 
rerun  simulations  to  pinpoint  bugs.  Real  multiprocessors  generally  provide  neither  of  these  abilities. 

Because  Proteus  runs  as  a  single  process,  it  works  well  with  sequential  debuggers  such  as  dbx  [Lin90]. 
This  extends  the  power  of  advanced  sequential  debuggers  to  the  parallel  development  arena.  Further¬ 
more,  Proteus  provides  an  internal  debugging  mode  that  allows  users  to  examine  the  states  of  threads, 
processors,  locks,  and  memory.  Combining  the  Proteus  debugger  with  a  sequential  debugger  such  as 
dbx  results  in  a  very  effective  development  environment. 

PROTEUS  also  provides  an  integrated  subsystem  for  data  collection  and  display.  Data  collection  is 
supported  by  primitives  for  recording  data  to  a  trace  file  and  by  user-defined  data  types.  Data  display 
is  performed  by  an  X-based  graph  program  that  uses  a  simple  but  powerful  graph  language  to  interpret 
the  trace  file  data. 

This  section  examines  PROTEUS’  support  for  nonintrusive  monitoring  and  discusses  repeatability  and 
nondeterminism.  It  then  examines  the  primitives  for  data  collection  and  concludes  with  a  discussion  of 
the  graph-generation  program. 

5.1  Nonintrusive  Monitoring 

Nonintrusive  monitoring,  combined  with  repeatability,  greatly  simplifies  the  development  of  concurrent 
programs.  Real  multiprocessor  systems  suffer  from  the  probe  effect,  the  addition  of  monitoring  code  may 
cause  the  monitored  effect  to  disappear  [Gai86].  This  prevents  programmers  from  collecting  additional 
data  for  debugging.  Proteus  allows  users  to  add  arbitrary  monitoring  or  debugging  code  without 
changing  the  behavior  of  the  simulation. 

For  non-cycle-counted  code,  the  addition  is  trivial.  Since  the  cost  of  the  code  is  not  determined 
by  cycle  counting,  the  monitoring  code  does  not  affect  the  cost,  which  ensures  no  change  in  behavior.6 


GThe  monitoring  code  may  alter  costs  if  desired,  but  this  is  unusual  since  it  could  change  the  system  behavior. 
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Thus  for  engine  code  and  most  architectural  modules,  the  addition  of  nonintrusive  monitoring  code  is 
straightforward. 

Adding  nonintrusive  code  to  cycle-counted  code  can  be  more  difficult.  In  this  case,  a  simple  addition 
will  change  the  behavior  since  the  cost  of  the  code  increases.  To  resolve  this  problem,  Proteus  allows 
users  to  turn  off  cycle  counting  temporarily  within  cycle-counted  code.  Thus,  a  typical  nonintrusive 
addition  would  first  turn  off  cycle  counting,  then  add  the  extra  code,  and  then  turn  on  cycle  counting.7 

It  is  conceivable  that  even  with  cycle  counting  turned  off,  the  addition  may  change  the  behavior  of 
the  application.  This  is  because  the  additional  code  may  affect  the  surrounding  code  indirectly.  For 
example,  if  the  additional  code  uses  several  registers,  the  surrounding  code  may  spill  more  registers  than 
the  previous  version.  This  would  increase  the  cost  and  thus  could  change  the  behavior  of  the  system. 

We  have  rarely  observed  this  problem  in  practice;  the  addition  of  monitoring  code  to  cycle-counted 
code  has  not  caused  the  effects  being  studied  to  disappear.  Should  it  occur,  however,  it  is  possible  to 
adjust  the  cost  of  the  monitored  code  so  that  it  matches  the  cost  it  had  prior  to  the  addition.  PROTEUS 
provides  primitives  for  increasing  and  decreasing  the  cycle  counter  by  a  delta,  so  it  is  easy  to  subtract  out 
the  extra  cycles  due  to  the  monitoring  code.8  Section  8  discusses  future  work  on  nonintrusive  monitoring. 

5.2  Repeatability 

Nonintrusive  monitoring  is  only  useful  if  the  platform  ensures  repeatability:  the  whole  point  of  nonin¬ 
trusive  monitoring  is  to  allow  repeatability  in  the  presence  of  additional  code.  Repeatability  is  perhaps 
the  single  most  important  feature  of  Proteus;  its  presence  provides  a  debugging  environment  that  is 
not  available  on  real  multiprocessor  systems. 

Nondeterministic  systems,  such  as  multiprocessors,  rarely  provide  any  form  of  repeatability;  some 
bugs  may  occur  only  once  every  ten  thousand  executions.  For  deterministic  programs,  such  as  Proteus, 
repeatability  is  the  rule  rather  than  the  exception.  Thus,  Proteus  simply  extends  the  repeatability 
inherent  in  sequential  programs  to  multiprocessor  applications. 

Given  that  Proteus  is  deterministic,  it  might  seem  reasonable  to  assume  that  it  can  reproduce  only 
one  of  the  many  possible  executions  of  a  fundamentally  nondeterministic  application.  In  fact,  however, 
Proteus  can  reproduce  multiple  executions  of  a  nondeterministic  application,  an  ability  unique  to 
Proteus  among  multiprocessor  simulators.  The  multiple  executions  arise  because  Proteus  chooses 
randomly  between  two  requests  with  the  same  timestamp;  Proteus  views  two  such  requests  as  a  race 

7  Turning  on  and  off  cycle  counting  is  done  with  macros  that  allow  nesting;  it  is  legal  to  embed  non-cycle-counted  macros 
into  code  that  already  has  cycle  counting  turned  off. 

8Thc  number  of  extra  cycles  can  be  determined  by  looking  at  the  assembly  code  or  by  printing  out  the  cycle  counter 
with  and  without  the  change.  Guessing  a  small  number  would  probably  work  as  well,  since  the  cost  only  needs  to  be 
accurate  enough  to  prevent  the  monitored  effect  from  disappearing. 


5.3  Data.  Collection 


13 


condition.  A  pseudo-random  number  generatoris  used  to  decide  the  race  condition;  this  provides  the 
determinism  required  for  repeatability.  At  the  same  time,  using  pseudo-random  numbers  implies  that 
changing  the  seed  changes  the  outcome  of  some  of  the  race  conditions  and  thus  leads  to  a  different 
execution  of  the  same  nondeterministic  application. 

For  most  applications,  the  ability  to  reproduce  multiple  executions  is  not  critical.  However,  some 
applications,  such  as  concurrent  branch-and-boUnd  search  algorithms,  exhibit  vastly  different  behavior 
depending  on  the  outcome  of  race  conditions.  In  the  case  of  a  concurrent  search  algorithm,  the  ability 
to  investigate  multiple  executions  allows  a  researcher  to  collect  a  distribution  of  execution  times,  which 
provides  a  much  more  accurate  view  of  the  effectiveness  of  the  algorithm.  As  expected,  some  Proteus 
applications  exhibit  a  wide  distribution  of  execution  times  when  the  random  number  seed  is  varied. 

5.3  Data  Collection 

The  ability  to  collect  exactly  the  desired  data  greatly  enhances  the  usefulness  of  simulation.  Proteus 
provides  a  framework  for  generating  trace  file  data  that  allows  users  to  generate  their  own  data  in 
addition  to  the  statistics  collected  by  the  engine  and  the  modules. 

The  simulator  uses  two  basic  kinds  of  data,  time-dependent  and  time-independent.  The  time- 
dependent  data  records,  called  events,  include  a  value,  an  index,  and  a  timestamp.  For  example,  a 
concurrency  graph  can  be  generated  using  events:  each  point  is  an  event  consisting  of  the  number  of 
busy  processors  and  the  timestamp.9  Any  graph  that  plots  something  versus  time  uses  events.  The 
index  field  is  used  when  generating  data  for  a  set  of  event  versus  time  graphs;  Figure  2  is  an  example. 

Time-independent  data  records,  called  metrics,  summarize  one  aspect  of  a  simulation  with  a  single 
value.  For  example,  the  execution  time  is  a  metric.  An  array  metric  is  simply  an  array  of  metrics. 
Processor  utilization  graphs,  for  example,  use  an  array  metric  with  one  metric  for  each  processor.  Metrics 
are  often  used  to  compare  the  results  of  several  simulations.  For  example,  the  nCUBE  graphs  in  Section  7 
plot  execution  time  versus  the  number  of  processors;  each  point  is  a  metric  from  one  simulation. 

In  addition  to  several  predefined  data  types,  users  can  define  their  own  event  types  and  metrics.  The 
interpretation  of  user  data  is  specified  in  a  simple  graph  language  used  by  the  graph  generator.  User- 
defined  data  types  allow  researchers  to  generate  high-quality  application-specific  graphs  in  very  little 
time.  Typically,  it  takes  only  a  few  minutes  to  define  a  new  event  type  and  specify  the  interpretation 
using  the  graph  language. 


°In  practice,  wc  use  two  events  for  concurrency  graphs,  one  that  indicates  a  processor  became  busy  and  a  second  that 
indicates  a  processor  became  idle.  The  index  field  contains  the  processor  number.  Tliis  allows  us  to  determine  exactly 
which  processors  ore  busy;  recording  the  count  directly  hides  tliis  information. 
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Map  state jnap  {  "Compute"  0,  "Send"  1,  "Receive"  2  } 


ArrayGraph  state  (p,  0,  NOJDFJ'ROCESSORS 
menu  <-  "Processor  State",  ; 

usemap  <-  state-map,  ; 

x_axis  <-  "Time",  ; 

y^axis  <-  "Processor",  ; 

action  { 

EV ESTATE:  VALUE (p) 


-  i)  { 

name  in  menu 
name-value  map 
x-axis  label 
y-axis  label 

use  the  value  of  events  with  index  p 


} 


} 


Figure  1:  Graph  specification  for  a  processor  state  graph.  The  Map  statement  defines  a  set  of  name-value 
pairs.  The  ArrayGraph  keyword  indicates  that  this  is  an  array  of  event  versus  time  graphs:  the  local 
variable  p  iterates  over  the  valid  processor  numbers,  and  the  resulting  graph  has  one  timeline  for  each 
processor.  The  action  clause  says  to  ignore  all  but  EVJ5TATE  events;  the  VALUE  action  sets  the  y  value 
to  the  value  of  the  event.  The  “(p)”  notation  indicates  that  the  index  field  of  the  event  must  match  the 
current  value  of  p,  so  that  only  events  from  the  relevant  processor  affect  the  timeline  for  this  iteration. 
A  graph  with  this  specification  appears  in  Figure  2. 


5.4  Graphics 

Proteus  provides  integrated  graphics  capabilities  that  are  not  available  with  comparable  simulators 
and  are  often  not  available  with  real  multiprocessors.  Proteus’  graphics  capabilities  make  it  simple  to 
evaluate  algorithms  and  architectures:  users  can  quickly  create  graphs  that  answer  their  key  questions 
and  provide  new  insight.  The  key  is  a  simple  but  powerful  graph-specification  language  that  tells  the 
graph  generator  how  to  interpret  the  trace  file. 

The  data  for  the  graphs  comes  from  the  events  and  metrics  stored  in  the  trace  file.  An  individual 
graph  specification  gives  meaning  to  the  events  and  metrics  by  determining  which  events  and  metrics 
are  relevant  and  by  specifying  how  to  build  a  graph  from  the  relevant  elements.  Figure  1  shows  a  typical 
graph  specification.  Like  most  graph  specifications  it  is  simple  and  very  short. 

The  graph  generator  produces  line  graphs,  bar  graphs,  and  tables,  and  can  combine  multiple  graphs 
onto  the  same  axes.  It  can  also  merge  data  from  multiple  simulations;  this  simplifies  comparison  of  an 
algorithm  across  a  range  of  architectures,  machine  sizes,  or  other  architectural  parameter.  The  generator 
uses  the  X  Window  system  and  can  produce  PostScript  hardcopy.  It  can  also  produce  PostScript  files 
for  inclusion  in  documents  such  as  this  paper. 

We  have  found  the  ability  to  create  new  graphs  quickly  to  be  an  excellent  debugging  aid.  The 
most  effective  approach  is  to  graph  the  state  of  each  processor  versus  time  and  then  combine  all  of 
the  timelines  into  one  graph.  Defining  new  event  types,  adding  the  data  collection  statements,  and 
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Figure  2.  This  graph  shows  the  state  of  four  processors  in  a  pipeline  search  tree  application  [CBDW91]. 
The  state  graph  is  generally  periodic,  the  width  of  the  period  reveals  the  throughput  of  the  pipeline.  A 
single  operation  has  a  “slope”  of  about  60  degrees  from  the  positive  time  axis,  the  pipeline  latency  can 
be  measured  directly  from  this  slope.  More  importantly,  this  graph  reveals  that  this  particular  algorithm 
is  spending  all  of  its  time  performing  communication,  very  little  of  the  graph  is  white.  A  change  to 
buffered  asynchronous  message  passing  resolved  this  problem. 
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specifying  the  graph  typically  require  a  total  of  about  ten  minutes.  Many  interesting  effects  are  visible 
on  these  graphs  including  livclock  and  deadlock.  F.xcessive  lock  holding  times  arc  readily  apparent,  as 
are  violations  of  mutual  exclusion.  In  addition  to  debugging,  these  graphs  are  usefu'  for  program  tuning 
since  they  indicate  how  long  different  states  last.  Figure  2  shows  one  of  these  graphs. 

The  data  collection  and  display  subsystem  gives  Proteus  a  unusual  level  of  effectiveness.  Users 
can  collect  and  display  the  data  they  need  to  answer  their  questions.  The  support  for  user-defined  data 
collection  and  user-specified  graphs  gives  users  of  PROTEUS  full  access  to  the  insight  available  through 
simulation. 


6  Performance 

Proteus  substantially  outperforms  comparable  multiprocessor  simulators.  By  providing  one  to  two 
orders  of  magnitude  improvement  in  performance,  PROTEUS  allows  researchers  to  investigate  applica¬ 
tions  and  machine  sizes  prohibited  by  the  performance  of  other  simulators.  Tabic  4  summarizes  the 
performance  of  three  multiprocessor  simulators. 
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6  PERFORMANCE 


Simulator 

Program  Slowdown  Per  Processor 

Best 

Typical 

Proteus 

2 

35-100 

ASIM 

200 

1,000-5,000 

Tango 

2 

500-2,000 

Table  4:  Overall  system  performance  for  several  multiprocessor  simulators. 

The  ASIM  simulator  [CLN90],  which  was  developed  for  the  Alewife  project  at  MIT,  is  a  fairly 
representative  instruction-interpreting  simulator.  The  overhead  of  instruction  interpretation's  reflected 
in  the  “Best”  column  of  Table  4,  and  limits  the  typical  performance  substantially. 

Tango  [DGB.91]  is  very  similar  to  Proteus  in  its  use  of  direct  execution  with  augmentation.  Thus, 
its  peak  performance  has  an  overhead  factor  of  about  two.  The  typical  performance,  however,  is  far 
worse  than  that  of  PROTEUS.  This  seems  surprising,  since  Tango  has  similar  overhead  for  augmentation. 
In  practice,  augmentation  overhead  is  an  insignificant  part  of  simulation  overhead;  simulating  non-local 
instructions  and  context  switching  dominate  the  cost  of  simulation.  It  is  in  these  areas  that  Proteus 
outperforms  Tango. 

Tango  uses  Unix  processes  for  each  simulated  thread,  which  rc.ults  in  a  context  switch  time  of  180 
to  250  microseconds  according  to  the  authors.  PROTEUS  uses  a  custom  lightweight-threads  package 
that  provides  context  switching  times  of  about  3  microseconds.  Even  with  lightweight  threads,  context 
switching  accounts  for  several  percent  of  the  total  running  time;  thus,  using  Unix  processes  would  greatly 
reduce  the  performance  of  PROTEUS.10 

Proteus’  lightweight  threads  exploit  "partial”  context  switches  if  the  switch  occurs  at  a  procedure 
call  boundary.  Invariants  hold  at  procedure  boundaries  that  limit  the  amount  of  context  that  must  be 
saved.  Because  we  use  procedures  to  implement  non-local  instructions,  it  is  quite  common  to  switch  at 
procedure  call  boundaries;  typically,  98%  of  all  context  sw;,:^'  involve  the  limited  context.11 

Tango  uses  Unix  semaphores  for  synchronization,  which  further  lir  performance  The  semaphores 
used  in  Proteus  are  significantly  faster.  In  addition,  Proteus  simulates  spinning  by  internally  blocking 
the  spinning  thread,  but  still  generating  the  correct  network  trail, t.  This  allows  Proteus  to  simulate 
spinlock  contention  without  suffering  from  contention  delays  itself.  Tango  performance  drops  an  order 

,  10Thc  authors  of  Tango  arc  developing  a  version  that  uses  lightweight  threads;  its  performance  should  bt  much  more 
competitive. 

11  Because  of  the  different  size  contexts,  we  musi  save  the  size  of  the  context  to  avoid  excess  copying  when  the  context 
is  restored. 
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of  magnitude  in  the  presence  of  high  contention. 

There  are  also  indirect  performance  benefits  from  running  in  a  single  address  space,  such  as  reduced 
memory  requirements  and  direct  access  to  ail  parts  of  the  simulator.  In  particular,  the  global  priority 
queue,  which  is  accessed  for  every  non-local  operation,  has  a  tuned  implementation  that  provides  access 
speed  that  would  not  be  possible  with  multiple  Univ  1-rocesses. 

All  of  these  decisions  combine  to  give  PnoTr.t  s  -  ,  of  performance  that  is  consistently  at  least 
an  order  of  magnitude  faster  than  other  multi;,  itt  o'  .1  -  iators.  During  application  development, 
the  performance  is  typically  two  orders  of  magni,.  _J  t  .iter  due  to  the  performance. accuracy  tradeoff 
provided  by  PROTEUS’  modular  structure. 


7  Validation 

This  section  compares  Proteus’  results  will  published  .sculls  from  a  real  multiprocessor.  If  the  simu¬ 
lator  produces  valid  data,  then  its  results  should  match  those  of  the  real  multiprocessor.  We  have  used 
published  results  to  validate  Proteus  several  times;  here  we  reproduce  results  from  a  comparison  of 
sorting  algorithms  on  an  nCUBE  multiprocessor. 

The  nCUBE  is  a  message-passing  multiprocessor  with  a  hypercube  topology;  that  is,  there  are  2" 
processors  with  each  processor  connected  to  n  other  processors.  Communicate  it  is  in  the  style  of 
CSP  [Hoa85]:  every  send  must  have  a  matching  receive.  The  primitives  transfer  data  blocks  via  DMA 
over  the  network  to  the  target  pro'^ssor.  There  is  no  cache  coherence. 

The  algorithm  comparison  com' .a  from  Quinn’s  paper  “Analysis  and  Benchmarking  of  Two  Parallel 
Sorting  Algorithms;  Hyperquicksort  and  Quickmerge”  [Qui89].  Quinn  compares  two  sorting  algorithms 
on  a  64-processor  nCUBE/7.  Both  algorithms  mix  local  sorting  with  communication;  they  differ  in  their 
strategies  for  dividing  the  values  among  the  processors.  In  general,  quickmerge  requires  fewer  but  larger 
messages  than  hyperquicksort. 

Figure  3  graphs  Quinn’s  hyperquicksort  times  along  with  times  for  the  nCUBE  version  of  Proteus 
and  a  version  with  a  generic  network  module.  The  nCUBE  version  provides  procedures  that  implement 
the  nCUBE  communication  primitives  and  uses  costs  adjusted  to  reflect  the  actual  communication  costs 
of  the  nCUBE,  which  are  much  higher  than  those  assumed  by  the  generic  network  module.12 

Since  we  use  direct  execution,  all  of  the  local  sorting  compiled  to  MIPS  code,  not  nCUBE  code. 
The  differences  in  local  instructions  and  oompileis  implies  that  we  must  scale  PROTEUS  cycle  counts  to 
correspond  to  nCUBE  seconds.  For  the  hyperquic-ksort  graph,  we  simply  picked  the aling  factor  that 


12Thanks  to  David  Culler  at  the  University  of  California  at  Berkeley  for  providing  nCUBE  timing  data. 
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7  VALIDATION 


— nCUBE  Proteus  Hyperquicksort 
.........Generic  Proteus  Hyperquickscf> 

Figure  3:  Hyperquicksort  Times 

provided  the  best  match;  thus,  for  hyperquicksort  (only)  the  match  between  Quinn’s  data  and  our  data 
is  deceptively  good. 

The  scaling  factor,  however,  should  be  inde  '".dent  of  the  application,  so  we  used  the  same  scaling 
factor  for  quickmerge.  Figure  4  graphs  Quinn’s  results  and  Proteus’  results  for  quickmerg;  The  key 
point  is  that  although  the  hyperquicksort  data  has  been  scaled  to  fit,  the  quickmerge  data  has  nv .  we 
first  established  the  ratio  of  Proteus  cycles  to  nCUBE  seconds,  then  we  ran  the  quickmerge  simulations. 
The  fact  that  the  quickmerge  data  matches  Quinn’s  data  well  validates  both  the  scaling  factor  and  the 
nCUBE  version  of  Proteus  as  a  whole.  Figure  5  presents  a  different  view  of  the  quickmerge  data;  the 
data  has  been  lormalizea  to  Quinn’s  results  so  that  the  error  in  individual  PROTEUS  points  is  more 
Visible. 

The  nCUBE  Proteus  results  match  the  pubilshc  i  ..esfujts  extremely  well,  especially  when  compared 
to  the  generic  network  module.  The  modifications  for  the  nCUBE  version  took  less  than  one  day  to 
implement,  but  resulted  in  substantially  more  accurate  simulations  -  these  facts  confirm  the  importance 
of  the  modular  structure.  Further  refinements  would  improve  the  accuracy  of  the  nCUBE  version,  but 
the  first  order  modifications  were  sufficient  to  obtain  results  consistently  within  four  percent. 

Evidence  for  the  accuracy  of  Proteus  comes  from  other  sources  as  well.  In  our  research  on  con¬ 
current  search  trees  [CBDW91],  we  found  that  Proteus  was  able  to  reproduce  published  search  tree 


«  nCUBE  Proteus  Quickmerge 


Figure  4:  Quickmerge  Times 


<*><— Normalized  nCUBE  Proteus  Quickmerge 


Figure  5.  Proteus  error  in  quickmerge  timra.  The  data  has  been  normalized  to  Quinn’s  data  to  clarify 
the  error  in  the  Proteus  results. 
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8  RELATED  WORK 


results  [CS90]  that  were  measured  on  a  Supernode  multiprocessor  [Nic88]., 

Proteus  also  reproduced  the  results  published  in  “Synchronization  without  Contention”  by  Mellor- 
Crummey  and  Scott  [MCS91].  This  paper  compared  locking  algorithms  on  both  a  Sequent  Symmetry 
and  a  BBN  Butterfly. 

In  general,  any  effect  that  we  expected  to  see  has  actually  appeared.  More  importantly,  all  unexpected 
results  have  (so  far)  proven  to  be  real  effects  rather  than  inaccuracies  introduced  by  PROTEUS.  For 
example,  we  noticed  excessive  communication  problems  in  David  Chaiken’s  cache-coherence  protocol 
that  severely  hindered  performance  [Cha90].  In  his  thesis,  Chaiken  predicted  the  possibility  of  cache 
thrashing,  but  he  did  not  know  if  it  would  be  a  problem  in  practice.  The  solutions  he  suggested  resolved 
our  problem,  confirming  that  the  excessive  communication  was  due  to  thrashing  in  the  cache.  The 
thrashing  problems  and  solutions  were  confirmed  by  Chaiken’s  own  simulations  using  ASIM  [CLN90]. 


8  Related  Work 

Augmentation  was  first  used  to  profile  sequential  programs  by  Weinberger  [Wei84];  direct  execution  with 
augmentation  for  multiprocessor  simulation  was  developed  by  Mathieson  and  Francis  for  their  Threads 
simulator,  and  by  Covington  et  al.  for  the  Rice  Parallel  Processing  Testbed  (RPPT)  [CMM+88],  and  is 
used  in  several  simulators  [DGH9I,  Che89,  SF89}.  Section  4  discusses  our  extensions  to  this  work. 

Among  these  simulators,  only  the  RPPT  provides  substantial  support  for  debugging.  It  provides  some 
form  of  “parallel  debugger/tracer”  that  interprets  and  controls  the  simulation.  In  contrast,  Proteus 
was  designed  to  work  well  with  sequential  debuggers  in  addition  to  providing  a  debugging  mode  that 
interprets  the  state  of  the  simulation  and  allows  single  stepping.  Debugging  in  Proteus  is  simple  and 
straightforward,  primarily  because  we  support  sequential  debugging  techniques. 

The  support  for  integrated  data  collection  and  display  is  unique  to  Proteus  among  execution-driven 
simulators,  although  Tango  provides  some  form  of  general  monitoring.  The  CARE  simulator  [DSNB87], 
which  simulates  LISP  code  using  direct  execution  and  a  hardware  timer,  provides  integrated  monitoring 
and  graphics.  The  TESS  simulator  [Sta85],  a  commercial  discrete-event  simulation  system,  provides  very 
general  data  collection  and  display  abilities,  but  is  not  very  useful  for  multiprocessor  simulation. 

The  modular  structure  of  Proteus  extends  the  separation  of  functionality  introduced  by  Tango. 
In  Tango,  it  is  easy  to  replace  the  memory  system  simulator  as  a  whole,  but  the  cache,  network,  and 
memory  systems  cannot  be  replaced  independently.  The  RPPT  provides  several  architectural  models, 
but  does  not  seem  to  support  customization  or  independent  replacement. 

The  ability  to  trade  accuracy  for  performance  is  exploited  to  a  small  extent  by  Tango,  which  provides 
multiple  versions  of  the  memory  system.  Proteus  makes  this  tradeoff  a  fundamental  part  of  the 
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8.1  Future  Work 

simulator.  It  provides  multiple  implementations  of  modules,  and  also  provides  several  parameters,  such 
as  the  quantum,  that  tradeoff  accuracy  and  performance. 

The  ability  of  Proteus  to  reproduce  published  results  provides  a  level  of  confidence  in  simulation 
results  that  is  absent  in  published  results  about  comparable  simulators.  The  execution-driven  simulation 
literature  makes  no  attempt  to  reproduce  results  from  real  multiprocessors. 

Proteus  also  extends  the  performance  of  execution-driven  simulation  by  combining  simulation  and 
analytical  models.  The  use  of  Agarwal’s  network  model  as  the  base  of  one  of  the  network  module  imple¬ 
mentations  provides  more  than  an  order  of  magnitude  increase  in  performance  in  network  simulation. 
Although  simulation  is  always  base.1  on  some  model,  our  use  of  analytical  models  is  novel  in  that  we 
make  no  attempt  to  simulate  what  actually  occurs  in  the  network.  Instead,  we  merely  attempt  to  com¬ 
pute  the  correct  costs  for  network  operations.  We  believe  that  the  explicit  use  of  analytical  models  has 
an  important  place  in  the  tradeoff  between  performance  and  accuracy:  when  used  within  their  limits 
they  provide  tremendous  performance  and  sufficient  accuracy. 

8.1  Future  Work 

One  of  the  primary  limitations  of  Proteus  is  the  restrictic:.  that  code  and  stacks  reside  in  private  (local) 
memory.  This  assumption  prevents  Proteus  from  having  to  simulate  cache  effects  for  every  instruction 
fetch  and  stack  access.  Although  removing  this  restriction  would  greatly  reduce  the  performance  of 
Proteus,  we  would  like  to  offer  the  increased  accuracy  as  an  option. 

We  would  probably  simulate  the  cache  effects  on  a  basic-block  basis;  that  is,  each  block  would  be 
augmented  with  calls  that  simulate  the  cache  effects  for  the  instruction  fetches  and  stack  accesses  in 
that  block.  The  implementation  is  complicated  by  the  dynamic  nature  of  the  addresses:  some  of  the 
addresses  cannot  be  determined  statically. 

We  would  also  like  to  provide  some  form  of  virtual-memory  simulation.  Although  most  research 
multiprocessors  do  not  use  virtual  memory,  many  of  the  smaller  commercial  machines  do. 

Finally,  we  hope  to  implement  fully  nonintrusive  debugging.  As  described  in  Section  5.1,  there  are 
some  cases  in  which  the  “nonintrusive”  code  indirectly  affects  the  monitored  code,  usually  by  changing 
register  allocation.  We  can  eliminate  these  effects  by  automatically  setting  the  cost  of  the  monitored 
code  to  its  value  before  the  monitoring  was  added.  Thus,  the  augmentation  program  would  read  the 
previous  version  of  the  monitored  code  to  obtain  the  correct  costs,  then  it  would  adjust  the  cost  of  the 
new  version  to  be  identical,  which  makes  the  monitoring  code  truly  nonintrusive.  Since  the  current 
approach  works  most  of  the  time,  and  users  can  adjust  the  costs  themselves  in  the  cases  that  fail,  this 
change  has  lower  priority  than  the  others. 
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9  Conclusion 

Proteus  provides  a  unique  combination  of  flexibility,  performance,  andaccuracy.  Its  modular  structure 
simplifies  customization  and  independent  replacement  of  individual  parts  of  the  simulator;  this  promotes 
modules  for  particular  architectures  and  multiple  implementations  that  provide  a  variety  of  performance 
and  accuracy  combinations.  The  division  into  independent  modules  also  clarifies  and  simplifies  each 
module,  which  makes  it  easier  to  tune  performance. 

The  overall  performance  of  Proteus  is  typically  an  order  of  magnitude  higher  than  comparable 
simulators,  this  is  due  primarily  to  the  use  of  direct  execution,  a  high-performance  lightweight-threads 
package,  and  efficient  simulation  of  synchronization.  When  the  high-performance  versions  of  modules 
are  in  use,  which  is  typical  during  development,  the  system  performance  increases  an  additional  order 
of  magnitude  over  other  multiprocessor  simulators. 

The  accurate  versions  of  modules  allow  PROTEUS  to  reproduce  published  results;  we  have  performed 
such  validations  several  times  in  addition  to  the  experiment  described  in  Section  7.  The  validation 
experiments  provide  a  significantly  increased  level  of  confidence  in  PROTEUS’  results.  In  general,  every 
effect  that  we  expected  to  see  has  actually  appeared,  and  every  unexpected  effect  turned  out  to  be  real. 

The  primary  use  of  Proteus  so  far  has  been  the  design  and  implementation  of  a  portable  parallel 
language  and  runtime  system.  It  has  also  been  used  for  research  on  concurrent  algorithms,  operating 
system  network  overhead,  and  fault  tolerance.  The  fault  tolerance  application  consists  of  roughly  10,000 
lines  and  runs  for  hundreds  of  millions  of  cycles. 

Proteus  provides  several  key  features  that  make  it  an  exceptional  platform  for  research  on  parallel 
systems: 

Flexibility:  Proteus  can  simulate  a  wide  variety  of  MIMD  multiprocessors,  including  both 
shared-memory  and  message-passing  machines. 

Performance:  Proteus’  performance  allows  simulation  of  applications  and  machine  sizes 
that  are  prohibited  by  other  simulators. 

Performance/ Accuracy  Tradeoff:  By  providing  only  the  required  accuracy,  Proteus 
maximizes  performance;  this  allows  exceptional  performance  during  development  since 
users  can  simply  switch  to  accurate  modules  when  needed. 

Repeatability:  Proteus  provides  repeatability,  which  is  critical  to  quality  debugging,  but 
rarely  available  on  real  multiprocessors.  It  lets  users  rerun  simulations  until  they  have 
pinpointed  a  problem. 
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Nonintrusive  Monitoring:  To  ensure  repeatability  despite  the  presence  of  additional  mon¬ 
itoring  code,  Proteus  allows  users  to  add  nonintrusive  monitoring  code.  This  allows 
users  to  gain  more  information  without  causing  an  effect  of  interest  to  disappear  due  to 
changes  in  timing. 

Use  of  Sequential  Debuggers:  Proteus  is  designed  to  work  well  with  standard  debuggers 
such  as  dbx;  this  brings  the  power  of  advanced  sequential  debuggers  to  parallel  software 
development. 

Data  Collection:  Users  can  collect  exactly  the  data  they  need,  including  user-defined  data 
types. 

Graphical  Output:  A  simple  but  powerful  graph-specification  language  allows  users  to  cre¬ 
ate  application-  or  architecture-specific  graphs  quickly  and  easily. 

Availability:  Proteus  allows  parallel-systems  research  to  take  place  on  standard  worksta¬ 
tions,  thus  avoiding  the  cost  and  limitations  of  real  multiprocessors. 

We  believe  that  these  advantages  will  make  PROTEUS  (and  tools  like  it)  a  fundamental  part  of 
parallel-systems  research — the  flexibility  and  the  ease  of  development  are  not  available  on  real  machines. 
Proteus’  high-quality  development  environment,  combined  with  its  flexibility,  accuracy  and  perfor¬ 
mance,  produce  not  only  a  high-performance  simulator,  but  a  powerful  tool  for  parallel  research  and 
development  in  general. 
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